Image Caption Generation for the Visually Impaired Using Deep Learning

Authors: Ritul Adhav, Shruti Deshpande, Noorbano Shaikh, Dr. V. S. Wadne

DOI Link: https://doi.org/10.22214/ijraset.2025.67707

Abstract

Automatically, generating textual descriptions of images: Image captioning Overall, the advancement of multimodal learning has been garnering increased interest in the computer vision and natural language processing communities because it has many potential applications from image retrieval to assistive technologies to content generation. In this abstract, a deep learning-based image caption generator is proposed. This system presents an overview of problems and state-of-the-art for image captioning including a particular interest in models based on the deep encoder-decoder architecture. Review some of the state-of-the-art evaluation metrics and datasets, with a discussion on the pros and cons of different methods. A Deep Learning Image Captioning State Of Art and How it Generates a Caption for an Image In a few words, the system showcases state-of-the-art image captioning and its way of generating captions using deep-learning concepts.

Introduction

1. Overview

The project explores image captioning, a field in computer vision where images are automatically described using natural language. It combines Convolutional Neural Networks (CNNs) for feature extraction and Long Short-Term Memory (LSTM) networks for sequential text generation. This technology is particularly designed to assist visually impaired individuals by converting images into descriptive text and audio using Text-to-Speech (TTS) engines.

2. Core Concept

Image Captioning: The system analyzes an image and generates a meaningful, sentence-level description.
Assistive Application: Designed for visually impaired users, it uses deep learning to translate visual content into speech, enhancing independence and interaction with digital media.

3. Methodology

A. Key Technologies Used

CNN: Extracts visual features such as shapes, colors, and objects.
LSTM: Processes extracted features to generate grammatically and contextually coherent captions.
TTS Engine: Converts text captions into voice output.

B. Process Flow

Image Preprocessing: Images are normalized and standardized.
Feature Extraction: CNN detects and encodes image components.
Sequence Modeling: LSTM interprets features and generates text.
Speech Output: Captions are spoken aloud using TTS for accessibility.

4. Literature Survey

Several studies have been reviewed:

Amritkar et al.: Combined CNN and RNN for generating human-like captions.
Agrawal et al.: Introduced attention mechanisms to improve detail and focus.
Sailaja et al.: Demonstrated CNN-LSTM effectiveness in real-world assistive applications.
Sharma et al.: Compared various RNN models for accuracy in captioning.
Mathur et al.: Developed lightweight models for real-time captioning on low-end devices.

These studies collectively highlight the effectiveness of deep learning in accessibility and visual-linguistic tasks.

5. Proposed System Design

A. Architecture Components

Dataset: Uses datasets like Flickr8k, containing images and their captions.
Preprocessing: Text is cleaned and tokenized; images are resized and normalized.
CNN: Identifies key visual elements.
LSTM: Generates sequential language based on CNN inputs.
TTS: Converts text into audio for user output.

B. System Diagrams

Data Flow Diagram (DFD): Shows the step-by-step transformation from image input to speech output.
Use Case Diagram: Highlights user actions like uploading images and receiving captions.

6. Algorithms Explained

A. CNN (Convolutional Neural Network)

Detects patterns and features in the image.
Uses layers (convolution, activation, pooling, fully connected) to generate a feature vector.

B. LSTM (Long Short-Term Memory)

Handles long-term dependencies in sequences.
Uses gates (input, forget, output) for managing memory and generating fluent captions.

7. System Evaluation & Results

Performance: The system generates coherent and relevant captions across various images.
Accuracy: High alignment between image content and captioned descriptions.
Accessibility: Audio descriptions enable visually impaired users to perceive visual content effectively.
Efficiency: Can be adapted for low-end devices with minimal loss in performance.

8. Applications

Accessibility Tools: For the blind and visually impaired in education, healthcare, and public media.
Smart Assistants: Voice-enabled systems to describe visual surroundings.
Content Tagging: Automatic labeling of images for organization and search

Conclusion

In summary, the speechbased concept image caption generator is a promising solution to improve accessibility and participation for visually impaired people. The system uses image enhancement, feature extraction using CNNs, text cleansing and tokenization, and LSTM-based models to generate text and description of input images. The system has many potential uses, including teachers, researchers, social media platforms, news and media organizations, and mobile applications. The system requirements ensure that it can perform its core functions, while the non-functional requirements ensure that the system is reliable, secure, and user-friendly. This system increases accessibility and enables visually impaired people to better understand the images they encounter by including details of the images. Overall, the proposed process is an important step towards increasing the accessibility and participation of visually impaired people. By creating texts and descriptions of images, the project can create positive and desirable outcomes that impact the lives of many people around the world.

References

[1] L. Burguen˜o, J. Cabot, S. Li, and S. Gerard, “A generic LSTM neural network architecture to infer heterogeneous model transformations,” Software and Systems Modeling, Vol. 21, no.1,pp.139-156, 2022. [2] Bittu Kumar,” Comparative Performance Evaluation of Greedy Algorithms for Speech Enhancement System” Fluctuation and Noise Letters,vol.20, no.02, 2020. [3] Bittu Kumar,” Real-time Performance Evaluation of Modified Cascaded Median based Noise Estimation for Speech Enhancement System” Fluctuation and Noise Letters, vol.18, no. 04, 2019. [4] Bittu Kumar,” Co mparat i v e performance evaluation of MMSE-based speech enhancement techniques through simulation and real-time implementation” International Journal of Speech Technology, vol.21, no. 04, 2018. [5] Sandeep Kumar, Bittu Kumar, Neeraj Kumar,” Speech Enhancement techniques: A Review” Rungta International Journal of Electrical and Electronics Engineering, vol. 1, no. 1, 2016. [6] K. C. Jena, S. Mishra, S. Sahoo and B. K. Mishra,” Principles, tech- niques and evaluation of recommendation systems”, 2017 International Conference on Inventive Systems and Control (ICISC), pp. 1-6,2017. [7] Wang, Haoran, Yue Zhang, and Xiaosheng Yu.” An overview of im- age caption generation methods.” Computational intelligence and neuro-science 2020. [8] BaiShuang, and Shan An.” A survey on automatic image caption gener- ation.” Neuro computing 311 (2018): 291-304. [9] Andrej Karpathy, and Li Fei-Fei, “Deep Visual Semantic Alignments for Image Description Generation,” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol39,issue4(April 2017), pp. 664–676. [10] Gaber, T., Tharwat, A., Snasel, V., and Hassanien, A. E., “Plant identi- fication: Two dimensional-based vs. one dimensional-based feature ex- traction methods’, international conference on soft computing models inindustrial and environmental applications, 2015.

Copyright

Copyright © 2025 Ritul Adhav, Shruti Deshpande, Noorbano Shaikh, Dr. V. S. Wadne. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET67707

Publish Date : 2025-03-20

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here